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(57) Abstract 



A Very Long 
Instruction Word (VUW) 
processor having a plurality 
of functional units includes 
a multiported register 
file that is divided into a 
plurality of separate register 
file segments, each of 
the register file segments 
being associated to one of 
the plurality of functional 
units. The register file 
segments are partitioned 
into local registers and 
global registers. The global 
registers are read and 
written by all functional 
units. The local registers 
are read and written only by 
a functional unit associated 
with a particular register file 
segment. The local registers 
and global registers are 




and global registers are c ^„ t ., v d(kfine d f or a register file segment/functional unit pair. The 

addressed using register addressed in an address ' ttatj ^Kdresses to the plurality of register file 

global registers are addresses within a selected global ^^ ng ™^reLd using register addresses in a local register range 

Lgment/ftWional unit pairs. The local registers ,n » «« S^t^^SS^SaSt pair. Register addresses in the local 

^t^^v^sssi^ ssasaW"- — ^ . «. 

segment/functional unit pair. 
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LOCAL AND GLOBAL REGISTER PARTITIONING IN A VLIW PROCESSOR 
TECHNICAL FIELD 

The present invention relates to storage or memory in a processor. More specifically, the present 
invention relates to a storage having local and global access regions for substructions in a Very Long 
Instruction Word (VLIW) processor. 

BACKGROUND ART 

One technique for improving the performance of processors is parallel execution of multiple instructions 
to allow the instruction execution rate to exceed the clock rate. Various types of parallel processors have been 
developed including Very Long Instruction Word (VLIW) processors that use multiple, independent funcnonal 
units to execute multiple instructions in parallel. VLIW processors package multiple operations into one very 
long instruction, the multiple operations being determined by sub-instructions that are applied to the independent 
functional units. An instruction has a se, of fields corresponding to each functional unit. Typical bit lengths of a 
substruction commonly range from .6 to 24 bits per functional unit to produce an instruction length often ,n a 
range from 1 12 to 168 bits. 

The multiple functional units arc kept busy by maintaining a code sequence with sufficient operations to 
keep instructions scheduled A VLIW processor often uses a technique called trace scheduling to maintain 
scheduling efficiency by unro.l.ng loops and scheduling code across basic function blocks. Trace scheduhng 
also improves efficiency by allowing instructions to move across branch points. 

Limitations of VLIW processing include limited parallelism, limited hardware resources, and a vast 
increase in code size. A l.m.tcd amount of parallelism is available in instruction sequences. Unless loops are 
unrolled a very large number of ,,mc>. msufficient operations are available to fill the instructions. L.m.ted 
hardware resources are a problem, no, only because of duplication of functional units but more importantly due 
to a large increase in memory and register file bandwidth. A large number of read and write ports are necessary 
for accessing the register file. ,mpus,ng a bandwidth that is difficult to support without a large cost m the s,ze of 
the register file and degradation in clock speed. As the number of ports increases, the complexity of the memory 
system further increases. To allow multiple memory accesses in parallel, the memory is divided into multmle 
banks having different addresses to reduce the likelihood that multiple operations in a single instruction have 
conflicting accesses that cause the processor to stall since synchrony must be maintained between the funct.onal 



units. 



Code size is a problem for several reasons. The generation of sufficient operations in a nonbranching 
code fragment requires substantial unrolling of loops, increasing the code size. Also, instructions that are not full 
may include unused substructions that waste code space, increasing code size. Furthermore, the increase m the 
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size of storages such as the register file increase the number of bits in the instruction for addressing registers in 
the register file. 

A register file with a large number of registers is often used to increase performance of a VLIW 
processor. A VLIW processor is typically implemented as a deeply pipelined engine with an "in-order" 
execution model. To attain a high performance a large number of registers is utilized so that the multiple 
functional units are busy as often as possible. 

A large register file has several drawbacks. First, as the number of registers that are directly addressable 
is increased, the number of bits used to specify the multiple registers within the instruction increases 
proportionally. For a rich instruction set architecture with, for example, four register specifiers, an additional bit 
for a register specifier effectively costs four bits per substruction (one bit per register specifier). For a VLIW 
word with four to eight substructions, sixteen to thirty-two bits are added for instruction encoding. Second, a 
register file with many registers occupies a large area. Third, a register file with many registers may create 
critical timing paths and therefore limit the cycle time of the processor. 

What is needed is a technique and processor architecture enhancement that improves the efficiency of 
instruction coding but still allows access to a large set of architecturally- visible registers. 

DISCLOSURE OF INVENTION 

A Very Long Instruction Word (VLIW) processor having a plurality of functional units and includes a 
multi-ported register file that is divided into a plurality of separate register file segments, each of the register file 
segments being associated to one of the plurality of functional units. The register file segments are partitioned 
into local registers and global registers. The global registers are read and written by all functional units. The 
local registers are read and written only by a functional unit associated with a particular register file segment. 
The local registers and global registers are addressed using register addresses in an address space that is 
separately defined for a register file segment/ functional unit pair. The global registers are addressed within a 
selected global register range using the same register addresses for the plurality of register file segment/ 
functional unit pairs. The local registers in a register file segment are addressed using register addresses in a 
local register range outside the global register range that are assigned within a single register file segment/ 
functional unit pair. Register addresses in the local register range are the same for the plurality of register file 
segment/ functional unit pairs and address registers locally within a register file segment/ functional unit pair. 

A VLIW processor utilizes a very long instruction word that includes a plurality of substructions. The 
substructions are allocated into positions of the instruction word. The VLIW processor includes a register file 
that is divided into a plurality of register file segments. The VLIW processor also includes a plurality of 
functional units, each of which is coupled to and associated with a register file segment of the register file. Each 
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. — — ~— *rir.zr' 

scents el having a reduced number of read an*or write pons in comparison to . no»dup,,c,tedreg,s«r 
~I X L same number of phvsloa, regis^. * — *• - ^ J* * 

navin. N c + (M * N J ,o,a, registers available for the M substructions. «• — ° f ^"J" „ . 
I "ne the N + (M • R) total refers remains equal ,0 the number of bi,s B that are used to address N - - 

,„ ™ example each of M equal ,o four register file secerns ineiudes N equal to 12. regisuns. The 

,92-223 in register file segment 3 are all addressed using register addresses 96-127. 

s„bu,struc, to and a savins of 1, hi. for a VL1W insrrucUo, The reduction in addrest - W 
advantageous inaVUW processor mat includes powerful funeral units ma, execute a !a*e plural,* of 
i„stn,c,ions, each of which is ,o be encoded in me VL1W instruction word. 

register file segments each having 12, register mav he programmably configured as a the 
global registers and 0 local regiaers with the ,28 registers addressed using seven address bn, 

^W^ters so th at the total number of registers ■ 64 + (4* 64)- -320«g*« thata_j 
H^^^^^ ot^erwiseTele^To address 320 registers. 
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BRIEF DESCRIPTION OF DRAWINGS 

The features of the described embodiments are specifically set forth in the appended claims. However, 
embodiments of the invention relating to both structure and method of operation, may best be understood by 
referring to the following description and accompanying drawings. 

FIGURE 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a 
processor in accordance with an embodiment of the present invention. 

FIGURE 2 is a schematic block diagram showing the core of the processor. 

FIGURE 3 is a schematic block diagram that illustrates an embodiment of the split register file that is 
suitable for usage in the processor. 

FIGURE 4 is a schematic block diagram that shows a logical view of the register file and functional 
units in the processor. 

FIGURES 5 A. SB. and 5C show a schematic block diagram of a divided or split register file, a high 
level view of computation elements of a functional unit, and a pictorial view of an instruction format, 
respectively, which are used 10 illustrate the difficulty of defining an instruction format with a limited number of 

instruction bits. 

FIGURE 6 is a schematic block diagram showing a register file for a VL1W processor that includes 
global and local register partitioning. 

FIGURE 7 illustrates a schematic block diagram of an SRAM array used for the multi-port split register 

file. 

FIGURE 8A and 8B arc. respectively, a schematic block diagram and a pictorial diagram that illustrate 
the register file and a memory array insert of the register file. 

FIGURE 9 is a schematic block diagram showing an arrangement of the register file into the four 
register file segments. 

FIGURE 10 is a schematic timing diagram that illustrates timing of the processor pipeline. 
The use of the same reference symbols in different drawings indicates similar or identical items. 
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Honrs FOR CAP -V1NH OUT THE INVENTION 

Referring to FIGURE 1, a schematic Week diagram illustrates a single int.gta.ed circuit chip 
ta p,cmcn,a,io„ or a pressor .0, ft. includes a memory interface .02, a „ '^Z, 
ml pressing u„i,s ... and „2. a shared da* cache and several interface controllers. Tta 
controL suppott an inactive graphics .nvironmen, with real-time constraints - ~ 
comets of memory, graphics, and input^u, bridge functionality - » singie die. me component are 
maul ,„ k ed and closely H„^ » ft. P-essor cor. with high bandwidth, low-latency co_»„ 
channels » manage multiple high-bandwidth data s*eams efficient and wid, a low response urne. The 
Lace ontrorl inch.de a an ArcWKctur. InKrconnee, (UFA) controller , , 6 and a penphera. 

RAM (ORDRAM) controller. The shared data cache ,06 is a dual-ported sto.ge to »s * - 

o™ unUs ,10 and , .2 with one port *— . each media processing un, ^d--*£-*- 
wa, » a„ocia,„e. follows a writeback ^ and supports hits in the f,.l buffer (no, shown). The data 
H « allows fas. data sharing and eliminates - _« for a complex, error-prone cache coherency protoco, 
between Ihe media processing units 110 and 112. 

taM- and main memory initiation it, parallel pipeline, for each coherent 

LI . .6 attains reduced latency on cache misses and improved Mi of address, datapath, an m»» 
JlTm comparison to directory-based systems. Directory-based systems maintain coherence state, for e.h 
dau bh« k ,» ml memory - reo,ire read-moduV-writ. pe.lt, for eve. read hansact.on 
mem.. The UFA controller . .6 is a centralized sysKm controller that removes the need to place c~he 
coherence k*fc on the processor 100 and DMA devices, fee** simplifying the ci,cu,try. 

The pel controller .20 is used as the primary system I/O interface for connecting standard. high- 
volume, .ow-cos. peripheral devices, although ofter standard interfaces may also be used. The PQ 
..Tcc.vcly transfers dau among high bandwidth peripherals and low bandwidth per.pher.ls, such as CD-ROM 
players. DVD players, and digital cameras. 

Two media proofing unHs ... and ..2 an= included in a single integral circuit chip to support an 
execute —en. exploiting thread level parallelism in which two independent threads can execute 
simultaneously. The threads may arise from any sources such as ft. same application, d.fTerent 

1 ■ - ~ two, instructions per cycle „ general purpose code. For example, the i_ processor 
Z*. ci-ht-widc machine with eight execution units for 

processing code k. an action level parallelism of about two so mat, on average, most (ab-u, s„> of the e gh. 
xecution'units wou,d be id,, a, any time. Tn. illusive process .00 employs ft~d ,.vel panrfLhsm and 
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operates on two independent threads, possibly attaining twice the performance of a processor having the same 
resources and clock rate but utilizing traditional non-thread parallelism. 

Thread level parallelism is particularly useful for Java™ applications which are bound to have multiple 
threads of execution. Java™ methods including "suspend", "resume", "sleep", and the like include effective 

5 support for threaded program code. In addition, Java™ class libraries are thread-safe to promote parallelism. 

(Java™, Sun, Sun Microsystems and the Sun Logo are trademarks or registered trademarks of Sun Microsystems, 
Inc. in the United States and other countries. All SPARC trademarks, including UltraSPARC I and UltraSPARC 
11, are used under license and are trademarks of SPARC International, Inc. in the United States and other 
countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, 

10 Inc.) Furthermore, the thread model of the processor 100 supports a dynamic compiler which runs as a separate 
thread using one media processing unit 110 while the second media processing unit 112 is used by the current 
application. In the illustrative system, the compiler applies optimizations based on "on-the-fly" profile feedback 
information while dynamically modifying the executing code to improve execution on each subsequent run. For 
example, a "garbage collector" may be executed on a first media processing unit 110, copying objects or 

1 5 gathering pointer information, while the application is executing on the other media processing unit 112. 

Although the processor 100 shown in FIGURE 1 includes two processing units on an integrated circuit 
chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a 
message-based coherent architecture and resident on the same die to process multiple threads of execution. Thus, 
in the processor 100, a limitation on the number of processors formed on a single die thus arises from capacity 
20 constraints of integrated circuit technology rather than from architectural constraints relating to the interactions 
and interconnections between processors. 

Referring to FIGURE 2, a schematic block diagram shows the core of the processor 100. The media 
processing units 110 and 112 each include an instruction cache 210, an instruction aligner 212, an instruction 
buffer 214, a pipeline control unit 226, a split register file 216, a plurality of execution units, and a load/store unit 

25 218. In the illustrative processor 100, the media processing units 110 and 112 use a plurality of execution units 
for executing instructions. The execution units for a media processing unit 1 10 include three media functional 
units (MFU) 220 and one general functional unit (GFU) 222. The media functional units 220 are multiple single- 
instruction-multiple-datapath (MSIMD) media functional units. Each of the media functional units 220 is 
capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single- 

30 instruction-multiple-datapath capability for the processor 100 including add, multiply-add, shift, compare, and 
the like." The media functional units 220 operate in combination as tightly-coupled digital signal- processors - 
(DSPs). Each media functional unit 220 has an separate and individual sub-instruction stream, but all three 
media functional units 220 execute synchronously so that the subinstructions progress lock-step through pipeline 
stages. 
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The .eneral functional -nit 222 is a RISC processor capable offing arithmetic logic unit (ALU, 
operations, loads and stores, hatches, and various specialized and esoteric functions such as parcel power 
:r«ra.,ons. reciprocal square root operations, and man, other. T,e genera, functional unit 222 supports less 
common parallel operations such as the parallel reciprocal square root instruct™. 

The illustrative instruction cache 2.0 has a .6 Kbyte captKit, and includes >»rdw»re support to 
maintain coherence, allowing dynamic oprimizations through self-modifying code. Software is use to ,nd,c«e 
thatthe instruction storage „ bemg modified when modifications occur. The 16K 

maintaine by h^dware that suppons write-dtrough. non-allocating -hing. Self-modifying ^e tssup^ried 
through explicit use of -store-to-iustruction-sp^" Actions ^2, Software uses me 
main! coherency with ,h. action cache 2,„ so that me ins.™c.,on caches 2,0 do no, have to be snooped 
on every single store operation issued by the media processing unit 110. 

The pipeline control unit 226 is connected between the instruction buffer 2.4 and the function., units 
and schedules the transfer of inarucnons to the ftmctiona, units. The pipeline control unit 226 
status signals from , he function,, units and the ,oad,s,„re unit 2„ and uses ft. sratus stgnals to perform sev^a, 
contro, fcnetions. The pipehne comro, unit 226 maimains » scoreboard, generates su.ls and bypass controls. 
The pipeline control unit 226 also generates traps and maintains specral regtsters. 

Each media pressing unit ... and . .2 includes a split register fie 2.6, a single logical register file 
mcuding .2, dmrytwo bit regis*rs. T*. split register f„e 2,6 is split into a plural,* of register filers 

..neral functional unit 222. In the illustrative embodiment, each register file s^mten. 224 has 12. 32-br, 
"registers. ^ fir* ,6 regis... «,-*> - *e register file segment 224 are global registers. 
can write to the ,6 global regisurs. The glob., registers are coherent across all funcona, „»,ts (MFU and OFU) 
othatan, write operation to a global register by any pactional unit ,s broadcast »»„ register file segment, ^ 

are not accessible or -visible" to other functional units. 

The media processing units . .0 and ..2 are highly smtctured compuuuion blocks that execute software- 
scheduled data computation operations with fixed, deterministic and relatively short instruction 

support multiple Action issue through a pragmatic very large Action word (VUW, approach dta. av ^ 

are typically complex, error-prone, and create multiple critical paths. A VL.W inaction word always mclu** 
one insuttction that execu.es in th. genera, ft.nc.iona, unit (OFU) 222 and from zero .o three mstrucfons that 
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execute in ihe media functional units (MFU) 220. A MFU instruction field within the VL1 W instruction word 
includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register 
field. 

Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to 
5 other instructions and with respect to otheHoads, allowing loads to be moved up in the instruction stream so that 
data can be streamed from main memory. The execution model eliminates the usage and overhead resources of 
an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. 
Elimination of the instruction ordering structures and overhead resources is highly advantageous since the 
eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated 
10 blocks consume about 30% of the die area of a Pentium II processor. 

To avoid software scheduling errors, the media processing units 110 and 112 are high-performance but 
simplified with respect to both compilation and execution. The media processing units 1 10 and 112 are most 
generally classified as a simple 2-scalar execution engine with full bypassing and hardware interlocks on load 
operations The instructions include loads, stores, arithmetic and logic (ALU) instructions, and branch 
1 5 instructions so that scheduling for the processor 100 is essentially equivalent to scheduling for a simple 2-scalar 
execution engine for each of the two media processing units 110 and 112. 

The processor 100 supports full bypasses between the first two execution units within the media 
processing unit 1 10 and 112 and has a scoreboard in the general functional unit 222 for load operations so that 
the compiler does not need to handle nondeterministic latencies due to cache misses. The processor 100 
scoreboards long latency operations that are executed in the general functional unit 222, for example a reciprocal 
square-root operation, to simplify scheduling across execution units. The scoreboard (not shown) operates by 
tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the 
instruction is finished and the result becomes available. A VLIW instruction packet contains one GFU 
instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an 
incoming VLIW instruction packet are checked against the scoreboard. Any true dependencies or output 
dependencies stall the entire packet until the result is ready. Use of a scoreboarded result as an operand causes 
instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the 
referencing instruction that provokes the stall executes on the general functional unit 222 or the first media 
functional unit 220, then the stall only endures until the result is available for intra-unit bypass. For the case of a 
load instruction that hits in the data cache 106, the stall may last only one cycle. If the referencing instruction is 
on die second or third media functional units 220, then the stall endures until the result reaches the writeback 
stage in the pipeline where the result is bypassed in transmission to the split register file 216. 



25 
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The scoreboard au.om».ically manages load delays to. occur during . tad hi,. In an illusoarive 
embcOimen,, all loads en,e, to scored . simplify software — - **" NOPS " 



stream. 



The scorch is used .o manage most in,e„ocKS bemeen to general funcdonal uni, 222 and to 
media actional uni* 220. All loads and non-pipelined ..ng-laumc, operation, of to generel functional « 
222 are scored. The ,ong-,a,ency openuion, include division *M» scions, recprocal square ™, 
^ S , OT „ - P«»«^ None of to ^ of to - M — 

^.iLrebolded. No„-s„reb^^^ 
produces to results following to latency of the instruction. 

The illustrative processor .00 has a rendering rate of over fifty million triangles per second without 
accounting for operating system overhead. Therefore, data feeding specifications of to processor 100 are far 

compressed geometry using to geometry decompress., .04. an on-chip readme geometry decompress 
engine. M geomeuy is stored in main memory in a compressed forma, A. render time, to data geometry ,s 
i fetched and decompress*, in real-time on to integrated circuit of to processor .00. The geometry 

decompressor ,04 advantageously saves memory space and memory transfer bandwidth. Tb. compressed 
geometry uses an optimized genentlized mesh structure to. explicitly calls out most shared vetoes beuveen 
triangles, allowing to processor 100 to reform and light most vertices only once. ,n , typ.ca, compress* 
mesh to tangle throughput of to trensform-and-l.gh. s*ge is leased by a fac,or of four or more ove to 
„ tooughpu, for isolated tmngles. Fo, example, during processing of triple, multiple verges are opereted 
upon in pa^lle. so tot to utilization ret. of resource, is high, achieving effective spatial software p.pe mmg 
L ope^ions are overlapped i„ rime by opera** on several verrices simuhaneously. rato, ton overrapm g 
save™, loop lotions in time. For outer types of applications wKh high insr.uc.ion level parallehsm. h,gh mp 
coun. loops are software-pipelined so to, mos. media mnc.i.n.1 unit, 220 are fully urilized. 
« Referring to FIGURE 3. a schematic bloc* diagram illustrates an embodiment of to split regis*, file 

2i« that is suitable fo, usage in to process., .00. The split regis.er file 2.6 supplies al, operands of processor 
morions tot execute in to media functional uni. 220 and to genera, functional units 222 and receives 
resuhs of to — execution from to execution units. The split register f„. 21. operates as an u„»f»ce to 
to geomeny decompress., .04. The spli. regis*, file 2.6 is to source and destination of**, and load 
30 operations, respectively. 

,„ to illusuative processor .00. to split register file 2.6 in each of to media processing units . .0 and 
.,2 has .28 registers. Graphics processing places a heavy burden on regis*, usage. Therefore, a large numbe, 
of register ,s supplied b, to regis*, file 2.6 so to. performance is no. limited by loads and stores or 
handling of i„,.rm.dia,e resuhs including gntphics "fiUs" and "spills-. The illustretiv. spin reg,s.e, file 2.6 



WO 00/33178 



PCT/US99/28820 



- 10- 

includes twelve read ports and five write ports, supplying total data read and write capacity between the central 
registers of the split register file 216 and all media functional units 220, the general functional unit 222 and the 
load/store unit 218 that is connected to the general functional unit 222. The five write ports include one 64-bit 
write port that is dedicated to load operations. The remaining four write ports are 32 bits wide and are used to 
5 write operations of the general functional unit 222 and the media functional units 220. 

Total read and write capacity promotes flexibility and facility in programming both of hand-coded 
routines and compiler-generated code. 

Large, multiple-ported register files are typically metal-limited so that the register area is proportional 
with the square of the number of ports. A sixteen port file is roughly proportional in size and speed to a value of 

10 256. The illustrative split register file 216 is divided into four register file segments 310, 312, 314, and 316, each 
having three read ports and four write ports so that each register file segment has a size and speed proportional to 
49 for a total area for the four segments that is proportional to 196. The total area is therefore potentially smaller 
and faster than a single central register file. Write operations are fully broadcast so that all files are maintained 
coherent. Logically, the split register file 216 is no different from a single central register file However, from 

1 5 the perspective of layout efficiency, the split register file 21 6 is highly advantageous, allowing for reduced size 
and improved performance through faster access. 

The new media data that is operated upon by the processor 100 is typically heavily compressed. Data 
transfers are communicated in a compressed format from main memory and input/output devices to pins of the 
processor 100, subsequently decompressed on the integrated circuit holding the processor 100, and passed to the 
20 split register file 216. 

Splitting the register file into multiple segments in the split register file 216 in combination with the 
character of data accesses in which multiple bytes are transferred to the plurality of execution units concurrently, 
results in a high utilization rate of the data supplied to the integrated circuit chip and effectively leads to a much 
higher data bandwidth than is supported on general-purpose processors. The highest data bandwidth requirement 

25 is therefore not between the input/output pins and the central processing units, but is rather between the 
decompressed data source and the remainder of the processor. For graphics processing, the highest data 
bandwidth requirement is between the geometry decompressor 104 and the split register file 216. For video 
decompression, the highest data bandwidth requirement is internal to the split register file 216. Data transfers 
between the geometry decompressor 104 and the split register file 216 and data transfers between various 

30 registers of the split registeriile 216 can be wide and run at processor speed* advantageously delivering a large 
bandwidth. In addition, the split register file 216 can be multiported which further increases total bandwidth. 

The register file 216 is a focal point for attaining the very large bandwidth of the processor 100. The 
processor 100 transfers data using a plurality of data transfer techniques. In one example of a data transfer 
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.eckniaue. cacheabl. da* is loaded I. ft. *» regis.er file 216 norma, load .prions . a low rare of 

UP ,„ eigh, byes per cycle. In anomer exampie. -earning daia is .referred . ft. ** »*- •"•« <■»-* 

group load options which rransfcr .hiny-tw. byes from memo, direc.1, too eigh. conserve 32*. 
regisrers. For example, me processor . 00 u„,u«s me seaming da* operation .o receive compressed vrdeo datt 

for decompression. 

Compressed graphics data is received via a direct memory access (DMA) unit in the geometry 
decompressor ,04. The compressed graphics data is decompressed by the geometry decompressor ,04 and 
loaded at a high bandwidth rate into the split register file 2, 6 via group load operations that are mapped to the 
geometry decompressor 104. 

Load operations are non-blocking and scoreboarded so that a long latency inherent to loads can be 
hidden by early scheduling. 

General purpose applications often fail to exploit the .arge register file 2,6. Statistical ana.ysis shows 
that compilers do not effectively use the large number of registers in the split register file 2,6. However, 
a-gressive in-lining techniques that have traditionally been restricted due to the limited number of renters » 

the split register file 2,6. In a software system that exploits the large number of registers in the processor ,00 
the complete set of registers is saved upon the event of a thread (context) switch. When only a few registers of 
the entire set of registers is used, saving al, registers in the ful, thread switch is wasteful. Waste is avo.ded . he 
processor ,00 by supporting individual marking of registers. Octants of the thirty-two registers can be marked as 
20 "dirty" if used, and are consequently saved conditionally. 

,n various embodiments, the split register file 2,6 is leveraged by dedicating fields for globals, trap 
registers, and the like. 

Referring to FIGURE 4, a schematic block diagram shows a logical view of the register file 2,6 and 
functional units in the processor ,00. The physical implementation of the core processor ,00 is simplified by 
,5 replicating a single functional unit to form the three media functional units 220. The media functional umts 220 
include circuits that execute various arithmetic and logical operations including general-purpose code, graph.cs 
code, and video-image-speech (VIS) processing. VIS processing includes video processing, image processes, 
digital signal processing (DSP) loops, speech processing, and voice recognition algorithms, for example. 

Referring to FIGURES 5A, SBVand 5C, a schematic block diagram of a divided or sp.h register f,le,a 
30 hi-h level view of computation elements of a functional unit, and a pictorial view of an instructs format, 
respectively are used to illustrate the difficulty of defining an instruction format with a limited number of 
instruction bits. FIGURE 5A shows a schematic block diagram of a decoder 502 that decodes four 
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substructions of a very long instruction word. Each of the four decoders applies control signals to one of four 
register file segments 510, 512, 514, and 516. Each of the register file segments is coupled to and associated 
with a functional unit. In the illustrative embodiment, a first register file segment 510 is coupled to and 
associated with a general functional unit 520. Second, third, and fourth register file segments 512, 514, and 516 
are respectively coupled to and associated with media functional units 522, 524, and 526. 

FIGURE 5B shows an example of a VL1W subinstruction, specifically a multiply-add (muladd) 
instruction and relates execution of the muladd instruction to computation blocks in a functional unit. The 
muladd instruction specifies four register specifiers designating data that is operated upon by the functional unit. 
The muladd instruction specifies three source operands R A , R B , and Rc, and one destination operand R D . The 
functional unit includes a multiplier 530 that multiplies the source operands R A and R B to generate a product. 
The functional unit also includes an adder 532 that receives the product from the multiplier 530 and adds the 
product and the source operand Rc to produce a sum that is transferred to the destination register operand R D . 

For a register file in which the register file segments include N = 2 M registers, for example, M bits are 
used to uniquely specify a particular register so that 4*M are needed to uniquely specify the four registers 
addressed in a single subinstruction. v 

FIGURE 5C depicts a subinstruction storage for instructions such as the muladd instruction. Resource 
size and speed constraints are imposed on instruction storage so that the number of bits in a subinstruction are 
limited. The four register specifiers for the subinstruction use nearly the entire capacity of the subinstruction 
storage. For example, a register file segment that includes 128 bits has registers that are uniquely addressed 
using seven address bits. Addressing of four registers consumes 7*4 = 28 bits. For a subinstruction size 
constrained to 32 bits, only four bits remain for specifying an operation code or other operational information for 
controlling execution. 

The illustrative VLIW processor partitions the register file into local and global registers to conserve 
address bits in a very long instruction word to reduce the size of the register file and accelerate access time. 

Referring to FIGURE 6, a schematic block diagram shows a register file 600 for a VLIW processor 100 
that includes global and local register partitioning. The Very Long Instruction Word (VLIW) processor has a 
plurality of functional units including three media functional units 622, 624, and 626, and a general functional 
unit 620. The processor 100 also includes a multi-ported register file 600 that is divided into a plurality of 
separate register file^segments 610, 612, 614, and 616, each of the register file segments being associated to one 
of the plurality of functional units. The register file segments 610, 612, 614, and 616 are partitioned into local 
registers and global registers. The global registers are read and written by all functional units 620, 622, 624, and 
626. The local registers are read and written only by a functional unit associated with a particular register file 
segment. The local registers and global registers are addressed using register addresses in an address space that is 
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separaKly defined fo, a register file segn.cn,, to*- unit P*r including register <i,e -P« 
filiona, unit regis,er file segmen, 6.2, media funcuona, uni, 622, reg,s,er f,e segmen, 6,4, 
funcUonal uni, 624. and ,egister file segment 616, ■»=•»» 626 

Tne global regis,.* are addressed wUhin a selected glob,, re 8 i*er rang, using .he same regis,., 

Lna, .o *e processor ,00. ,he M gioba, r.gis*rs are addressed using address -<'«^^ 
re g ,s,er file sclents. Loca, registers ,6-27 in ,h. register Hie segmen, ^^"^^^ 

independent registers is 96 + (4*32) - ^i. mc^ & 

space from 0-127. ramer tan *. 8 bi B tha. a« odterwise required ,o access 224 reg.sier, 

Global and loca, regis,er p»*ionin g advan» g .ous,y ieverages «.e informal ™^ 
specify bus in an ins,ruc,ion word by inherent commun.ca.ing information b y posnron dependence »,*«, 
nlcion .roup. The c-suioningof a regis,., specifier in th. insuueinn word ,hus -n»n,ca,» 

in fewer bits than have been specified conventionally. 

One address bit is thus saved for each of th. four subins.nac.ion positions, a savings of four bns per 

ad™*g.ous in a VL,W processor that includes powerful functional — that execute a large plural,* 

inactions, each of which is to be encoded in the VL1W insrrueuon word. 

,„ genera, —ems, ft. regis*, file 6.0 includes N physic., registers. The N -regis«r regime 
«. ,s dupliLed into M regis., fil. segments 6.0, 6.2, 6,4. and 6.6, each having a ^ 
and,or wL pons in comparison » . nondu P lica*d re g is«, file, b. each hav,n g .he same number of ph, ca, 

^ m - 0 B reuisters The local registers for each of the M register me 
numberofbitsBthatareusedtoaddressN-2 registers, ineio s 

segments are addressed using the same B-bit values. 
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In some embodiments, partitioning of the register file 600 is programmable so that the number N G of 
global registers and number N L of local registers is selectable and variable. For example, a register file including 
four register file segments each having 128 registers may be programmably configured as a flat register file with 
128 global registers and 0 local registers with the 128 registers addressed using seven address bits. Alternatively, 
5 die four register file segments may be programmably configured, for example, to include 64 global registers and 
64 local registers so that the total number of registers is 64 + (4*64) = 320 registers that are again addressed 
using 7 bits rather than the 9 bits that would otherwise be required to address 320 registers. 

Referring to FIGURE 7, a schematic block diagram depicts an embodiment of the multiport register file 
216. A plurality of read address buses RA1 through RAN carry read addresses that are applied to decoder ports 

10 816-1 through 816-N, respectively. Decoder circuits are well known to those of ordinary skill in the art, and any 
of several implementations could be used as the decoder ports 816-1 through 816-N. When an address is 
presented to any of decoder ports 816-1 through 816-N, the address is decoded and a read address signal is 
transmitted by a decoder port 816 to a register in a memory cell array 818. Data from the memory cell array 818 
is output using output data drivers 822. Data is transferred to and from the memory cell array 818 under control 

1 5 of control signals carried on some of the lines of the buses of the plurality of read address buses RA1 through 
RAN. 

Referring to FIGURE 8A and 8B, a schematic block diagram and a pictorial diagram, respectively, 
illustrate the register file 216 and a memory array insert 910. The register file 216 is connected to a four 
functional units 920, 922, 924, and 926 that supply information for performing operations such as arithmetic, 
20 logical, graphics, data handling operations and the like. The illustrative register file 216 has twelve read ports 

930 and four write ports 932. The twelve read ports 930 are illustratively allocated with three ports connected to 
each of the four functional units. The four write ports 932 are connected to receive data from all of the four 
functional units. 

The register file 216 includes a decoder, as is shown in FIGURE 6, for each of the sixteen read and 
25 write ports. The register file 216 includes a memory array 940 that is partially shown in the insert 710 illustrated 
in FIGURE 8B and includes a plurality of word lines 944 and bit lines 946. The word lines 944 and bit lines 946 
are simply a set of wires that connect transistors (not shown) within the memory array 940. The word lines 944 
select registers so that a particular word line selects a register of the register file 216. The bit lines 946 are a 
second set of wires that connect the transistors in the memory array 940. Typically, the word lines 944 and bit 
30 lines 946 are laid out at right angles. In the illustrative embodiment, the word lines 944 and the bit lines 946 are 
constructed of metal laid out in different planes such as a metal 2 layer for the word liftes^944'and metai 3 layer 
for the bit lines 946. In other embodiments, bit lines and word lines may be constructed of other materials, such 
as polysilicon, or can reside at different levels than are described in the illustrative embodiment, that are known 
in the art of semiconductor manufacture. In the illustrative example, the word lines 944 are separated by a 
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ta constructed fo, various processes. The iHusastive exampie shows one bit iine per port, omer embodunents 
may use multiple bit lines per port. 

When . particuiar — unit reads , parties register in the register fUe 2 .6, ^«,c™, u„, 
sends - address sign., via the read por* » - activates the appropriate word iines to 
register <i,e having a convention* stn.cn.re and rweive read ports, each c*U, each stonng a _» t. of 
"Lion, is connected to ,».,vc w„d iines ,. seiec. „ address »d ™.,v« „ iines to car. da» r»d f rom .be 

address. 

The four write pores »2 address register in .be register fiie using four word lines M4 



cell. 



3 



Thus, if .be m—ivc J~ 01. 2.6 were ,aid on, i„ a conven.iona, manner with tweive read pons 
have an intend circuit «. of 2»»m= « .*.«■ * area is proportion,, to me sduare of the number of pons. 

The register Hie 2.0 K. altemat.v.iy implemented ,o perform singie-ended and/or singled 

r on ,*. per port per eel,, or implemented to perform differential reads and/or d.ffe^al 

writes utilizing a single bil line per port per ecu, y 

writes using two bil lines per port per cell. 

However, in th,s «r— ,b. regis.er m «t is - « - ™ *■ — ^i™™ '"^ 

schematic block diagnun show* an arrangement of the register Hie 2.6 into me four regrster fiie segment m. 
rt^L me Jremam, or— as a singie iogica, renter fi,e in me sense that the fcW the, eg,s«r 

e of be same capacit, - , « The sepa„,ed regis*. r„e segment 2» differ from a reg^r fi, *a, 

lister file segment to ,he Cher nine read ports are elimimucd. AD wriKs are broadcast so ma, each of the four 
^hasthreeldports and four write ports fora^ai of seven ports. The individual are ejected to 

approxima^i, 49pm 1 . In me illusuative embodiment m. four regis,er fiie segmems 2» have an area 
proportional ,o seven sou^d. The tou, ar» of the four regis., fiie segments 224 ,s therefore proportional to 

times 4, a total of 196. 
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Th e split register file thus advantageously reduces the area of the memory array by a ratio of 
approximately 256/196 (1.3X or 30%). The reduction in area further advantageously corresponds to an 
improvement in speed performance due to a reduction in the length of the word lines 944 and the bit lines 946 
connecting the array cells that reduces the time for a signal to pass on the lines. The improvement in speed 
performance is highly advantageous due to strict time budgets that are imposed by the specification of high- 
performance processors and also to attain a large capacity register file that is operational at high speed. For 
example, the operation of reading the register file 216 typically takes place in a single clock cycle. For a 
processor that executes at 500 MHz, a cycle time of two nanoseconds is imposed for accessing the register file 
216. Conventional register files typically only have up to about 32 registers in comparison to the 128 registers in 
the illustrative register file 216 of the processor 100. A register file 216 that is substantially larger than the 
register file in conventional processors is highly advantageous in high-performance operations such as video and 
graphic processing. The reduced size of the register file 216 is highly useful for complying with time budgets in 
a large capacity register file. 

In some embodiments, the area of a register file is further reduced by using a special memory cell for 
the local registers that have write port connections to the functional unit that is locally associated with the register 
file segment. Thus the local registers are only written by the local functional unit. In the illustrative register file 
216, the special cell for local registers has only a single write port, reducing the number of word lines to four. 
The number of bit lines connected to the local registers is also reduced to four, allowing further compactness to 
the cell. 

Referring to FIGURE 10, a simplified schematic timing diagram illustrates timing of the processor 
pipeline 1100. The pipeline 1100 includes nine stages including three initiating stages, a plurality of execution 
phases, and two terminating stages. The three initiating stages are optimized to include only those operations 
necessary for decoding instructions so that jump and call instructions, which are pervasive in the Java™ 
language, execute quickly. Optimization of the initiating stages advantageously facilitates branch prediction 
since branches, jumps, and calls execute quickly and do not introduce many bubbles. 

The first of the initiating stages is a fetch stage 1110 during which the processor 100 fetches instructions 
from the 16Kbyte two-way set-associative instruction cache 210. The fetched instructions are aligned in the 
instruction aligner 212 and forwarded to the instruction buffer 214 in an align stage 1 112, a second stage of the 
initiating stages. The aligning operation properly positions the instructions for storage in a particular segment of 
the four register file segments 310, 312, 314, and 316 and for execution in an associated functional unit of the 
three media functional units 220 and one general'furictionai unit 222. In a third stage, a decoding stage 1114 of - 
the initiating stages, the fetched and aligned VLIW instruction packet is decoded and the scoreboard (not shown) 
is read and updated in parallel. The four register file segments 310, 312, 314, and 316 each holds either floating- 
point data or integer data. The register files are read in the decoding (D) stage. 
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Following me decoding -» I "<«« — « "« rf '" raed - ^ "T^Z 

« . ,r,p.Hand,i„g suge no. and a ***** s<age 1 « during wbien resul, da,a is wriuen-ba* 

splii register file 216. 

While the invention ha, been described wnh reference to various embeds, i, will b. undersmod 
,bat mese embodiments are illustrative and tha, the scope of me invention is no, limited ,. to Many 
variations, modifications, addons and improvement of me embodiment described are posstMe. Por e^mple, 

and can be varied ,o achieve me desued sm,c«,re as we,, as motions which ar. wimin me scope of*. 

sue ^^f,l.,»»b.r^^or™^^*»te*^l2.»^ Theverymng 
insuuclion word may include any suitable number of subinstructions. 

Simi lariy. .Khough rh. musrmiv. regis*r file has one bi, l,n. per port, in omer embodiments more bi, 
„„es m.v be a,,oca,ed for . port. The described word iines and bit lines are formed of a m«al. In other 
example's ...he, conductive maKrials such as doped polysi.ic.n may be employed for >— 
descid te^er f„e uses sin g ,e.endcd re^s and writes so ma, , single bi, line is emp.oyed per - »- P~ 
,„ o,bcr processors, different reads and wri.es wim dual-ended sens, amplifiers may be used so ma, twobn 
,i„cs Janocatedperbitandc-port, resulting ,„ a bigger pitch, tended sense amp,,r,e.m,p ro ve m«nory 
fidCtv bo, cre a„y increase me s i2 e of a memory array, imposing a heavy burden on speed performance. Thus 
me advances anained by me described roister file sm,c,»rc are magnified for a memory usmg d,fferenua, 
reads and wri.es. The spring benveen bi, lines and word ,ines is describe* ,o b. approximate* 1pm. Insome 
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WE CLAIM 



1 1 . A processor comprising: 

2 a plurality of functional units; and 

3 a register file characterized in that the register file is divided into a plurality of register file segments, 

4 ones of the plurality of register file segments being coupled to and associated with ones of the 

5 plurality of functional units, the register file segments being partitioned into global registers 

6 and local registers, the global registers that are accessible by the plurality of functional units, 

7 the local registers being accessible by the functional unit associated with the register file 

8 segment containing the local registers. 

1 2. A processor comprising: 

2 a decoder for decoding a very long instruction word including a plurality of substructions, the 

3 substructions being allocated into positions of the instruction word; 

4 a register file coupled to the decoder and divided into a plurality of register file segments; and 

5 a plurality of functional units, ones of the plurality of functional units being coupled to an associated 

6 with respective ones of the register file segments, ones of the plurality of substructions being 

7 executable upon respective ones of the plurality of functional units, operating upon operands 

8 accessible to the register file segment associated with the functional unit of the plurality of 

9 functional units, the register file segments including a plurality of registers that are partitioned 

10 into global registers and local registers, the global registers being accessible by the plurality of 

1 1 functional units, the local registers in one of the register file segments being accessible by the 

12 functional unit associated with the register file segment. 

1 3. A processor according to either Claim 1 or Claim 2 wherein: 

2 the processor is a Very Long Instruction Word (VLIW) processor. 

1 4. A processor according to Claim 1 or Claim 2 wherein: 

2 the local registers and global registers are addressed using register addresses in an address space that is 

3 defined for a register file segment/ functional unit pair. 



1 

2 



5. A processor according to Claim 1 or Claim 2 wherein: 
the register file is a multi-ported register file. 
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6. A processor according to Claim 1 or Claim 2 wherein: 

the .oca, registers in a register f,e segment are addressed using register addresses ,n a .oca regtster 
functional unit pair. 

7 A processor according to Claim 1 or Claim 2 wherein: 

■ addr esses in the ,oc.l renter range are the s^e for the polity of reg.ster f,.e segment/ 
functiona. unit pairs and address registers .ocal.y within a register f,e segment/ funcnona. untt 



register < 



pair. 



8. A processor according to Claim 1 or Claim 2 wherein: 

the register file includes N physical registers and is duplicated into M register file segments, the regtster 
f„e segments having a reduced number of read and/or write ports in comparison to a 
nonduplicated register file, but each having the same number of physical reg.ster, 

9 A processor according to Claim 8 wherein: 

^register - segments are panned into N G global and N L local reg^ 

equal to N, the register file operating equivalent* to a register file havmg N G + (M N J to*, 
listers available for the M functional units, the number of address bits for addressing the N G 
+ (M • N L ) total registers being equal to the number of bits B that are used to address N = 
registers, the loca, registers for ones of the M register file segments are addressed ustng the 
same B-bit values. 

10 A processor according to Claim 9 wherein: 

N L of local registers is selectable and variable. 
H. Accessor according to Cairn 1 or Claim 2 wherein the register file is a storage array structure 

having R read ports and W write ports comprising: 

a plurality of storage array storages; . 

for the plurality of storage array storages is R read ports; and 
the storage array storages having W write ports. 
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12. A processor according to Claim 1 1 wherein: 

the storage array structure is a sixteen port structure with twelve read ports and five write ports; and 
the plurality of storage array storages includes four storage array storages each having three read ports 
and five write ports. 

13. A processor according to Claim 11 wherein: 

the storage array structure is a sixteen port structure with twelve read ports and four write ports; and 
the plurality of storage array storages includes four storage array storages each having three read ports 
and four write ports. 

14. A processor according to Claim 1 1 wherein: 

the writes are fully broadcast so that all of the storage array storages are held coherent. 
! 5. A processor according to Claim 1 1 wherein: 

storage array storages include storage cells having a plurality of word lines and a plurality of bit lines, 
the word lines being formed in one metal interconnect layer, the bits lines being formed in a 
second metal interconnect layer. 

16. A method of operating a processor comprising: 

operating a plurality of functional units; and 

dividing a register file into a plurality of register file segments; 

coupling and associating ones of the plurality of register file segments with ones of the plurality of 
functional units; 

partitioning the register file segments into global registers and local registers; 
accessing the global registers by the plurality of functional units; 

accessing the local registers by the functional unit associated with the register file segment containing 
the local registers. 

17. A method according to Claim 16 further comprising: 

addressing the local registers and global registers using register addresses in an address space that is 
defined for.a register. /He. segment/ functional unit pair. . . e .> 
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1 8 A method according to Claim 1 6 further comprising: 

addressing the .oca. registers in a register fi.e segment using register addresses in a .oca. raster range 
outside the g.oba. register range that are assigned within a sing.e register fi.e segment/ 



functional unit pair. 



1 9 A method according to Claim 1 6 further comprising: 

addressing the .ocal register range the same for the p.ura.ity of register file segment/ functiona. unit 
pairs and address registers local.y within a register file segment; functional unit pa.r. 

20. A method according to Claim 1 6 further comprising: 
including N physical registers in the register file; 

duplicating the physical registers into M register fi.e segments, the register fi.e segments having a 

reduced number of read and/or write ports in comparison to a nondup.icated reg.ster file, but 
each having the same number of physical registers. 

2 1 A method according to Claim 20 further comprising: 

partitioning the register file segments into N c global and N L local register files where N G plus N L ,s 
op^ratin^^ 

the M functional units, the number of address bits for addressing the N G + (M total 
registers being equal to the number of bits B that are used to address N = 2 registers; and 
Addressing the local registers for ones of the M register file segments using the same B-b,t values. 

T> A method according to Claim 20 further comprising: 

p^grammably partitioning the register file so that the number N c of global registers and number N L of 

local registers is selectable and variable. 



WO 00/33178 



PCT/US99/28820 




SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 



2/7 



110 



PC, 



MPU1 

^210 



Instruction Cache 



212 



Instruction Aligner 



214 



\226^±. 



Instruction B uffer 
» - 



PCU 



\_a22 



MFU3 



MFU2 



216 



Y22Pf 



MFU1 



GFU 



220 



7 



Register Files \ 



21 8^224 ^2241 ^224 



T 



224 



Load/Store Unit 



v 



112 



PC, 



MPU2 



210 



Instruction Cache 



212 



Instruction Aligner 



214 



Instruction Buffer 



226-ZI 



PCU 



\j222X 



MFU3 
Y~222 



MFU2 



I 



L 



222 



MFU1 



220 



GFU 

1' 



Register Files 



218-± 



\-224 v 224 f ^224 v 224 



Load/Store Unit 



Shared Data Cache and Synchronization Area 



FIG. 2 



SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 



3/7 



216 



Broadcast Writes (5) 





1 








\ 










314-\ 








310s 






RF3 




RF2 




RF1 




RFO 



* * t 

3 Read Ports 3 Read Ports 3 Read Ports 3 Read Ports 

FIG. 3 



Global > 
Registers] 
12R/4W * 

or 
12R/5W 



216 



A. 



Register File 



\r >r \r 



MFU 



I 



MFU, 



MFU, 



i 



GFU 



i 



220 L 220 L 220 L 222 
FIG. 4 



502 



RF3 



57 6 A I 57 4-\ I 51 2^ j 510-± 



MFU3 
526 



RF2 
12 



MFU2 
524 



RF1 



MFU1 
522 



RFO 
7 



GFU 
520 



FIG. 5A 



SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 



4/7 



muladd Ra,Rb,Rc,Rd 
R A % Rc 

Y T 



530- 



i 



T 



FIG. 5B 



OPCODE 


Ra 


Rb 


Rc 


Rd 



FIG. 5C 



602- 



600 



616- 



614- 



612- 



32 local 



96 
global 
TT 



MFU3 
626 

Fadd 



610^ 



32 local 


i 


32 local 


" 96 
global 




" 96 
global 








r' " 1 

MFU1 
622 




:ui , 

GFU 
620 


Muladd 




Id 



FIG. 6 



SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 



RA1 



RA2 



RAN 



5/7 



DECODER 
P0RT1 
816-1 




FIG. 7 



MEMORY 
CELL 
ARRAY 
818 



OUTPUT 
DATA DRIVERS 
822 



T 



940- 




v 930- 
3 924-k 



/ 



930-1 
926^ 



4 



/a 



FIG. 8A 

SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 




FIG. 9 



SUBSTITUTE SHEET (RULE 26) 



WO 00/33178 



PCT/US99/28820 




SUBSTITUTE SHEET (RULE 26) 



INTERNATIONAL SEARCH REPORT 



tRtar. ««d Application Mo 

PCT/US 99/28820 



XaS^CATOOH OF SUBJECT UATTCR 

IPC 7 G06F9/30 



B. P" "3 8EAK CHED 

IPC 7 S06F 



■B^ct^d^b^ censed durtng^ hf.^ — h (nam. <* datable and, ^ptao^^tem,^ 



nnfamENTS COW8n)ERgD TO BE RELEVAKT 



Category' 



X 
A 



EP 0 767 425 A (DIGITAL EQUIPMENT CORP) 
9 April 1997 (1997-04-09) 
column 3, line 31 - line 42 
column 6, line 8 - line 15 
column 10, line 43 - line 48 
column 11, line 19 - line 31 



US 4 980 819 A (SHEN JIAN-KU0 
25 December 1990 (1990-12-25) 
column 2, line 28 - line 52 

column 3, line 45 - line 53 
column 4, line 54 - line 58 
column 5, line 27 - line 60 
column 7, Une 10 -column 8, 
column 10, Une 3 - Une 20 
figure 1 



I ryj Further <tooi*nert» are Utodtn the ocnt*TuaUon <rf box C. 

• Special categories of dted doeunanti : 

"A" document defW*»» oeneral art wWch la not 

eoneld>i*totoofpeifleutarete*anea ^ , 

"E" eariWdoam«rtbutput*«hc<lonarafterlho kitemaflonaJ 

V <*oaroei*whtehmaY^ 

I cttaflon or o^cf special recaai(a» «P«ffl©d) 

W document irfe^ to an exhfcmonor 



ET AL) 



i -P- document pubWwd tho^j^one4 Itttgaatebut 
1 later than thepdorlty date claimed 



22 March 2000 

Name and mating adrfrea* of ***SA 

European Patent Office, P.B. 581 8 Patenflaan 2 

Tet (+31-70) 340-204aTx. 31 661 eponU 
Fax: (431-70) 340-3016 

toon PCfrtaAttiO (etoond atwcf) («Wy 1 WSJ 



line 8 



-/- 



Relevant to dalm Mo. 



1,2,4-9, 
16-21 



10,11, 
14,22 

1,2,4-7, 
16-19 
8,9,11, 
14 



10 



Patent family members ate Isted In annex. 



T" later document pubQahed afterjie J^^S^^S^ 
^priority date and not In cordlct *ffli the a«*caflonbut 
^S^^^^ptk^^ti^ underlying the 

Invention i . 

-X- docurnertoliaejtlculaxieie^^ 

am* be oomtdeied novel or cannot be coneldeied to 
^^^SS^ taken alone 

-V dooumertolpejtoiarKrfevan^ 

cannot be considered to Involve an ^ wr ^^* e PJ^^^° 
taument la combined wffiicne "^^J^SS^ 
meraWauc*<»ri*xnallonbe^ pereonaWled 
In tie art. i ^ _ 

■V docutiertmenabef of trio same 

Date of mating of the littenwilor^ aeaich teport 

30/03/2000 

Authorized officer 



Moral t1, M 



page 1 of 2 



INTERNATIONAL SEARCH REPORT 



tntm* fttd A f/k^ S itiJddin No 

PCT/US 99/28820 



a(Cotttmustkm) DOCUMENTS COM8IPEREP TO BE RELEVANT 



Category* I Cttatlon o* doomnt *tth *dcalicn.whenB a}3proprtaiB. c# the retevant parages 



Relevant to dafcn No. 



US 5 530 817 A (MASUBUCHI YOSHIO) 
25 June 1996 (1996-06-25) 
column 3, line 1 - line 18 

EP 0 588 341 A (TOYOTA MOTOR CO LTD) 
23 March 1994 (1994-03-23) 
page 3, line 54 -page 4, line 9 

EP 0 676 691 A (HEWLETT PACKARD CO 

; HITACHI LTD (JP)) 

11 October 1995 (1995-10-11) 

column 5, line 25 -column 6, line 14 



1-3,16 



1,2,7, 
16,19 



10,22 



fartn PCMBAJZtQ (oananurtcn <rf Moond AmQ (Ji4y 1983} 



page 2 of 2 



INTERNATIONAL SEARCH REPORT 

^iemtfian on patMitfamiy mtnbm 




Foicn 



PCTASAttlO (! »»' * * ** W 190g > 



THIS PAGE BLANK mm, 



